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Abstract 

The measurement of treatment integrity is critical to evaluate the efficacy and effectiveness of evidence-based programs 
(EBPs) designed to improve the developmental outcomes of young children at risk of emotional/behavioral disorders. 
Unfortunately, the science of treatment integrity measurement lags behind the development and evaluation of EBP for 
young, high-risk children. This article describes the development and preliminary psychometric properties of the BEST in 
CLASS Adherence and Competence Scale (BiCACS), designed to measure the adherence and competence of delivery of 
the BEST in CLASS prevention program. Independent observers coded videotaped (n = | 16) and live (n = 289) observations 
of teachers delivering the BEST in CLASS program. The BiCACS showed good interrater reliability and analyses provided 
some support for the validity of the measure. Implications for future research and integrity measurement work are 


discussed. 
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Many young children who attend early childhood programs 
display high levels of problem behaviors. The severity and 
intensity of these children’s problem behaviors negatively 
impact their learning (Driscoll & Pianta, 2010; Quesenberry, 
Hemmeter, & Ostrosky, 2011) and their teachers find them 
difficult to manage (Hemmeter, Corso, & Cheatham, 2006). 
Fortunately, there are evidence-based programs (EBPs) 
available to address the needs of young children in early 
childhood classrooms who are at elevated risk for the devel- 
opment of emotional and behavioral disorders (EBD). For 
example, Incredible Years (Webster-Stratton, Reid, & 
Hammond, 2004) and Preschool PATHS (Bierman et al., 
2008; Domitrovich, Cortes, & Greenberg, 2007) have pro- 
duced promising outcomes in multiple randomized trials. 
However, it is a challenge to transport and implement EBPs 
in authentic early childhood classrooms (Domitrovich, 
Moore, & Greenberg, 2012). One reason the field struggles 
with EBP implementation is most programs lack comprehen- 
sive professional development tools and procedures for train- 
ing and supporting early childhood educators in their use. 
Carroll and Nuro (2002) identify four elements that are 
required to develop EBPs and evaluate their effectiveness: 
(a) a standardized treatment model (e.g., treatment manual); 
(b) a well-defined target population; (c) documented and 
standardized procedures for selecting, training, and super- 
vising interventionists; and (d) tools to monitor treatment 


integrity. Each element is designed to help investigators 
interpret study findings as well as aid the transportability 
(i.e., scale-up) of EBPs. Although some existing EBPs (e.g., 
Incredible Years; Preschool PATHS) meet three of the four 
requisites needed for this work (i.e., treatment manuals, tar- 
get population, training protocols), most EBPs lack vali- 
dated treatment integrity measures designed to support 
program evaluation and teacher training (Hagermoser 
Sanetti, Dobey, & Gritter, 2012; Schulte, Easton, & Parker, 
2009). 

Treatment integrity refers to the degree to which an EBP 
was delivered as intended. When developing and evaluat- 
ing an EBP, it is important to develop tools to assess two 
components of treatment integrity (Carroll & Nuro, 2002): 
treatment adherence and competence. Treatment adher- 
ence refers to the extent to which an EBP is delivered as 
designed (ie., delivery of prescribed interventions) 
whereas competence refers to the level of skill and degree 
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of responsiveness demonstrated by a teacher when deliver- 
ing the prescribed interventions. Assessing these integrity 
components allows researchers to answer key questions 
related to internal validity needed to support program eval- 
uation (Carroll et al., 2000). Indeed, assessing adherence 
and competence allows researchers to establish whether (a) 
a program was delivered as designed, and (b) there was 
variation in delivery across teachers, classrooms, and/or 
schools. Treatment integrity tools also allow researchers to 
engage in process-outcome analyses that can help optimize 
the impact and delivery of an EBP (Carroll et al., 2000; 
Durlak, 2010). By investigating linkages between interven- 
tion components and outcomes, important questions can be 
asked: Were particular intervention components linked to 
outcomes? Do various children, classrooms, and/or set- 
tings require different intervention components to maxi- 
mize outcomes? Psychometrically strong integrity 
measures can therefore play an important role in the devel- 
opment and evaluation of EBPs. 

Integrity measures also play an important role in efforts 
to establish and maintain treatment integrity, which is inte- 
gral for implementation research (Southam-Gerow & 
McLeod, 2013). Establishing and maintaining treatment 
integrity via teacher training and coaching is critical in 
implementation research and integrity measurement plays a 
significant role in this process. Treatment integrity mea- 
sures can be used to (a) establish adherence and competence 
benchmarks used to guide teacher training efforts, and 
(b) assess the outcome of teacher training and coaching 
efforts. Integrity measures also play an important role in 
interpreting findings in implementation research by deter- 
mining whether EBPs are implemented as designed 
(McLeod, Southam-Gerow, Bair, Rodriguez, & Smith, 
2013). For these reasons, integrity measures are considered 
to be critical for implementation research. 

The purpose of this article is to describe our attempt to 
develop and validate an integrity measure designed to sup- 
port program evaluation and teacher training for a program 
targeting the reduction of young, high-risk children’s prob- 
lem behavior. Specifically, we describe the development of 
the BEST in CLASS Adherence and Competence Scale 
(BiCACS; Sutherland & McLeod, 2010). The BiCACS is 
an observational treatment integrity measure designed to 
support the development and evaluation of the BEST in 
CLASS program. BEST in CLASS is a theoretically driven 
program based on evidence-based instructional practices 
that target problem behaviors of young children at high risk 
for the development of EBD (Conroy, Sutherland, Vo, Carr, 
& Ogston, 2013; Sutherland, Conroy, Abrams, & Vo, 2010; 
Vo, Sutherland, & Conroy, 2012). Conceptualized as a 
“value-added” model, BEST in CLASS is designed to 
increase the quantity and quality of specific instructional 
practices that have been demonstrated to prevent and reduce 
the occurrence of young children’s problem behaviors. Two 


pilot investigations of the BEST in CLASS program have 
provided promising initial data on the model (Conroy et al., 
2013; Vo et al., 2012). 

In this report, we describe the development and report on 
the psychometric properties of the BiCACS. We first 
describe the development of the BiCACS. Then, we report 
data from two studies designed to evaluate the psychomet- 
ric properties of the BiCACS relevant to using the measure 
for program implementation and evaluation. First, we eval- 
uate whether trained coders can reliably code BiCACS 
items using videotaped recordings of the implementation of 
BEST in CLASS. Second, we evaluate whether coaches can 
reliably code the BiCACS items. This information is impor- 
tant for program implementation and refinement because 
coaches are commonly used to establish and maintain treat- 
ment integrity in school-based trials. We therefore wanted 
to evaluate whether coaches could achieve adequate reli- 
ability on the BiCACS items under normal training condi- 
tions. We also examined the potential for the BiCACS to 
inform evaluation efforts by assessing the validity of the 
measure. Specifically, we examined the construct validity 
of the measure by examining its sensitivity to change over 
time as well as its relation to a measure of teacher—child 
relationships. 


Method 
BEST in CLASS 


The BiCACS was developed in the context of an Institute for 
Education Sciences Development and Innovation Goal 2 
project (see Vo et al., 2012). All data reported below were 
pulled from this parent project. BEST in CLASS, conceptual- 
ized as a Tier-2 program, is a manualized classroom-based 
program that targets the reduction of problem behaviors dem- 
onstrated by young children at risk of EBD. Teachers are 
trained to deliver the BEST in CLASS program via a 6-hr 
professional development workshop and 14 weeks of perfor- 
mance-based coaching provided by trained coaches (see Vo 
et al., 2012 for a description of the program development pro- 
cess). Training and coaching focus on eight learning mod- 
ules: (a) Basics of Behavior and Development; (b) Rules, 
Expectations, and Routines; (c) Behavior-Specific Praise; 
(d) Precorrection and Active Supervision; (e) Opportunities 
to Respond and Instructional Pacing; (f) Instructive and 
Corrective Feedback; (g) Home—School Communication; 
and (h) Linking and Mastery. While the efficacy of BEST in 
CLASS is currently being investigated in a multisite random- 
ized controlled trial, preliminary data from the Goal 2 project 
were promising. Specifically, two pilot, nonexperimental 
investigations of BEST in CLASS suggest that the program 
had an impact on observed teacher instructional and child 
behaviors (Conroy et al., 2013) as well as standardized mea- 
sures of child behavior (Vo et al., 2012). 
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Table I. BiCACS Adherence and Competence Items. 


abhWDN — 


child exhibiting problem behavior. 


. Teacher reviews rules, addresses rule violations—teacher statement that includes classroom rule. 

. Teacher uses clear routines (within and between activities)—teacher uses procedures and activities to provide structure. 

. Teacher maintains brisk instructional pace—rate at which teacher provides instruction. 

. Teacher provides precorrection—instruction or prompt to remind child of appropriate behavior. 

. Teacher uses proximity control and visual monitoring—teacher visually monitors child and positions herself in close proximity for 


6. Teacher provides preacademic OTR—question, prompt or signal by teacher that seeks an active, observable, and specific child 


preacademic response. 


7. Teacher provides social/behavioral OTR—question, prompt or signal by teacher that seeks an active, observable, and specific child 


social/behavioral response. 


8. Teacher provides behavior specific praise—verbal approval statement that tells child the specific behavior for which they are being 


praised. 


9. Teacher provides corrective feedback—specific information provided to child after error occurs. 
10. Teacher provides instructive feedback—teacher statement that provides extra instructional information when responding to 


child’s correct response or appropriate behavior. 


Note. BICACS = BEST in CLASS Adherence and Competence Scale; OTR = opportunities to respond. 


Development of the BiCACS 


The BiCACS is a 20-item scale designed to assess the 
adherence and competence of the core BEST in CLASS 
program components. The BiCACS was developed via a 
three-step process. 


Step |: Item development. Our first step was to develop 
items for the two BiCACS subscales: Adherence and Com- 
petence subscales. First, we identified all prescribed content 
from the BEST in CLASS treatment manual that repre- 
sented the core BEST in CLASS interventions (see Vo et al., 
2012). Second, the developers of the BEST in CLASS pro- 
gram reviewed the list of items to ensure that all interven- 
tions essential to the theory underlying the BEST in CLASS 
program were included (see Vo et al., 2012). The program 
developers provided feedback about item content, including 
suggestions for additional items. The resulting item pool 
was checked against the treatment manual. In all, this pro- 
cess generated 10 items for the Adherence and Competence 
subscales, respectively (see Table 1). 


Step 2: Scoring strategy. Next, we determined the appropri- 
ate scoring strategy for the BiCACS subscales. For the 
Adherence subscale, we used a scoring strategy based on 
past treatment integrity research in the child psychother- 
apy literature (e.g., Hogue, Liddle, & Rowe, 1996; 
McLeod & Weisz, 2010) that involves macroanalytic 
extensiveness ratings. This scoring strategy requires cod- 
ers to estimate the extent to which teachers engage in each 
intervention during an observation using a 7-point Likert- 
type scale with the following anchors: | = not at all, 3 = 
somewhat, 5 = considerably, and 7 = extensively. Exten- 
siveness ratings are comprised of two key components: 
thoroughness and frequency. Thoroughness refers to the 
depth, complexity, or persistence with which the teacher 


engages in a given intervention. Frequency refers to the 
number of times throughout an observation that a given 
intervention is executed (regardless of the thoroughness of 
the intervention in any particular segment). Thoroughness 
and frequency are considered in making an extensiveness 
rating on each item; therefore, extensiveness ratings pro- 
vide quantity, or dosage, information about each BEST in 
CLASS program. 

For the Competence subscale, we adopted a scoring strat- 
egy that involves macroanalytic competence ratings that 
estimate the technical quality of interventions (skillfulness) 
and their timing and appropriateness for the given child and 
situation (responsiveness). This scoring strategy is used in 
exemplar competence coding systems developed for youth 
(Hogue et al., 2008) and adult (Carroll et al., 2000) psycho- 
therapy. In assessing competence, coders are asked to make 
ratings on a 7-point Likert-type scale with the following 
anchors: 1 = very poor; 3 = acceptable; 5 = good; 7 = excel- 
lent. For each item, coders are asked to consider the extent to 
which a teacher demonstrated the following dimensions: 
(a) expertise, commitment, motivation; (b) clarity of lan- 
guage; (c) appropriate timing of interventions and actions 
(responsiveness); and (d) ability to read and respond to 
where the child appears to be (responsiveness). 


Step 3: Scoring manual. Once the items were developed, a 
draft of the scoring manual was produced. The scoring man- 
ual was intended to promote interrater reliability by provid- 
ing coders with clear scoring procedures, item definitions, 
exemplars, and item distinctions for the Adherence and 
Competence subscales (see Hogue et al., 1996). Coders 
used the scoring manual to code videotaped sessions of the 
BEST in CLASS program being delivered by teachers in 
early childhood classrooms. Feedback from the pilot coding 
was used to produce a revised version of the BiCACS scor- 
ing manual. 
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Study | 


It is important for treatment integrity measures to demon- 
strate reliability at the item level to support program evalu- 
ation and inform efforts to optimize the impact and delivery 
of an EBP. The first study was therefore conducted to evalu- 
ate the initial reliability of the BiCACS items and subscales. 
Trained coders rated recordings of teachers delivering the 
BEST in CLASS program in early childhood classrooms. 


Participants 


Child participants. In Years 2 and 3 of the development proj- 
ect, the BEST in CLASS program was implemented in 25 
state- or federally funded classrooms serving high-risk chil- 
dren with 47 focal children within one suburban and one 
urban school district on the east coast. As part of program 
development, videotaped observations were collected in 19 
classrooms (n = 32 child participants). These recordings 
were used in Study | to examine the initial reliability of the 
BiCACS. 

Multiple measures were used to screen child participants 
for inclusion in the study. All focal children were between 3 
and 5 years of age and enrolled in state and federally funded 
early childhood programs designed to provide services for 
children at elevated risk. The first two stages of the Early 
Screening Project (ESP; Walker, Severson, & Feil, 1995) 
were used to identify potential focal children. In the first 
stage, teachers nominated up to five children in their class- 
rooms who demonstrated the most severe and chronic prob- 
lem behaviors, and consent from parents or guardians of all 
nominated child participants was sought. Next, to confirm 
risk for EBD, teachers completed the Externalizer 
Questionnaire of the ESP on each of the child participants 
for whom consent was obtained. Children were then 
assessed with the Battelle Developmental Inventory, second 
edition screener (BDI II Screener; Newborg, 2005) and if 
children demonstrated average or above average cognitive/ 
intellectual abilities, they were retained in the sample. 
Following this screening process, children with the top two 
most extreme scores on the ESP were included in the sam- 
ple as the focal children. 

The 32 children (24 males and 7 females; data missing 
for one child) in Study 1 all qualified for free and reduced 
lunch and averaged 3.97 years of age (SD = 0.32; range 
from 3-5). Of the children, 25 were African American, 2 
were Caucasian, | was Latino, | was Asian/Pacific Islander, 
and | was Other (biracial); data were missing for two chil- 
dren. All children scored as “at risk” for future development 
of EBD on the ESP and scored within the normal range on 
the BDI II Total Screening Score (Newborg, 2005). 


Teacher participants. Of the 19 teachers (1 male and 18 
females) who volunteered to participate (6 had a bachelor’s 


degree and 13 had a master’s degree), 13 were Caucasian, 5 
were African American, and | was Latina. They averaged 
9.84 years of experience working with preschool-aged chil- 
dren (SD = 9.77; range 0-34 years). 


Coders. The coding team consisted of two research assis- 
tants who were both Caucasian females. One coder was 
completing her BA in psychology, and one was completing 
her master’s degree in school counseling. 


BiCACS Scoring and Session Sampling 
Procedures 


Recordings (n = 116) from Years 2 and 3 of the BEST in 
CLASS development project were selected for coding. 
These recordings were collected as part of the develop- 
ment work to examine teachers’ use of BEST in CLASS 
interventions with focal children during instructional 
activities. The sessions occurred during teacher-led 
instructional activities (e.g., circle time, small group), 
ranged in length from 3 min and 19 s to 16 min and 18 s 
(M = 13.62; SD = 3.40), and represented times when the 
teacher and focal child were in the classroom. Session 
length was determined a priori and was based on several 
factors. First, indicators of high quality early childhood 
programs suggest that teacher-led activities should be rel- 
atively brief (e.g., 10-20 min; Cate, Diefendorf, 
McCullough, Peters, & Whaley, 2010). In addition, behav- 
ioral observers have found that given an adequate base 
rate of responses, a 10- to 20-min observation typically 
results in a representative sample of behavior (Thompson, 
Felce, & Symons, 2000). The coders trained over a 
2-month period to reach adequate prestudy reliability on 
all BiCACS items (intraclass correlation coefficient [ICC] 
> .59; Cicchetti, 1994). Training consisted of reading the 
scoring manual, review of specific observation segments, 
and practice scoring of observations. Once scoring com- 
menced, the observations were randomly assigned to cod- 
ers and regular reliability assessments were performed. 
The results of these assessments were discussed in weekly 
meetings with the first two authors to prevent coder drift 
(Margolin et al., 1998). 


Results 


We first evaluated whether the Study 1 sample was compara- 
ble with the original sample. The Study | sample of children 
(n = 32) did not differ significantly from the original sample 
(n = 47) in demographic characteristics (e.g., age, gender, 
race/ethnicity). Similarly, the Study 1 sample of teachers (n = 
19) did not significantly differ from the original sample (7 = 
25) in demographic or training characteristics (e.g., age, gen- 
der, race/ethnicity, degree, years of experience). 
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Table 2. Item Scores and Interrater Reliability of the BiCACS—Adherence Subscale. 


Study | Study 2 
Item description M (SD) Minimum Maximum ICC M (SD) Minimum Maximum ICC 
Rules 2.55 (2.27) | 7 92 3.63 (2.28) | 7 90 
Clear routines 5.43 (1.83) | 7 .80 5.01 (1.32) | 7 44 
Brisk instructional pace 6.18 (1.40) I 7 88 4.99 (1.42) I 7 59 
Precorrection 3.72 (2.11) | 7 77 4.45 (2.33) I 7 87 
Proximity control 6.56 (1.03) 2 7 73 6.18 (1.25) | 7 77 
Preacademic OTR 6.57 (1.26) | 7 82 5.89 (1.81) | ui .63 
Social OTR 1.80 (1.19) | 7 .66 6.07 (1.45) I 7 .67 
Behavior specific praise 1.93 (1.29) | 7 81 4.08 (2.36) I 7 Il 
Corrective feedback 3.78 (2.84) | 7 76 3.15 (1.77) | 7 .70 
Instructive feedback 1.41 (0.89) | 7 .64 2.69 (1.72) | 7 74 


Note. BICACS = BEST in CLASS Adherence and Competence Scale; ICC = intraclass correlation coefficient; OTR = opportunities to respond. 


Table 3. Item Scores and Interrater Reliability of the BiCACS—Competence Subscale. 


Study | Study 2 
Item description M (SD) Minimum Maximum ICC M (SD) Minimum Maximum ICC 
Rules 6.43 (0.91) 4 7 95 5.20 (1.59) | 7 84 
Clear routines 6.40 (0.91) 3 7 8| 5.14 (1.34) | 7 42 
Brisk instructional pace 6.59 (0.71) 4 7 90 5.13 (1.40) | 7 67 
Precorrection 6.18 (1.01) 2 7 72 5.20 (1.42) | 7 76 
Proximity control 6.67 (0.80) 3 7 69 6.01 (1.17) | 7 39 
Preacademic OTR 6.74 (0.61) 4 7 72 5.65 (1.19) 2 7 55 
Social OTR 5.57 (1.18) 2 7 56 5.62 (1.14) 2 7 .63 
Behavior specific praise 6.08 (0.87) 4 7 87 5.08 (1.47) | 7 85 
Corrective feedback 6.1 1 (0.82) 4 7 64 4.69 (1.38) | 7 68 
Instructive feedback 5.33 (1.15) 4 7 Jl 4.61 (1.38) | 7 65 


Note. BICACS = BEST in CLASS Adherence and Competence Scale; ICC = intraclass correlation coefficient; OTR = opportunities to respond. 


To determine if coders could reliably code items on the 
BiCACS, interrater reliability was calculated using ICC 
(Shrout & Fleiss, 1979; see Tables 2 and 3). The ICC provides 
an estimate of the ratio of the true score variance to total vari- 
ance. These correlations therefore provide a reliability esti- 
mate that allows for generalizability of the results to other 
samples. Following Cicchetti (1994), ICCs less than .40 
reflect “poor” agreement, ICCs from .40 to .59 reflect “fair” 
agreement, ICCs from .60 to .74 reflect “good” agreement, 
and ICCs of .75 and higher reflect “excellent” agreement. 

Interrater reliability was first calculated for each item on 
the Adherence and Competence subscales. According to 
Cicchetti’s (1994) criteria, interrater reliability for the 
Adherence items ranged from “good” to “excellent” (ICCs 
ranged from .64 to .92 [M = .78, SD = .09], see Table 2), with 
7 of the 10 items in the “excellent” range (ICC > .74), and 
three in the “good” range. The Adherence subscale was highly 
reliable (ICC = .93). The Competence items ranged from 
“fair” to “excellent” (ICCs ranged from .51 to .95 [M = .74, 


SD = .15], see Table 3), with 4 of the 10 items in the “excel- 
lent” range, 4 in the “good” range, and the remaining 2 items 
in the “fair” range (see Cicchetti, 1994). The Competence 
subscale was also highly reliable ICC = .84). In sum, the 
Adherence and Competence subscales along with the items 
comprising the subscales demonstrated adequate reliability. 

Theoretical and empirical work has thus far not clarified 
the amount of overlap between the adherence and compe- 
tence integrity components (see Barber, Sharpless, 
Klostermann, & McCarthy, 2007), so we examined the 
degree of overlap between the Adherence and Competence 
subscales. Subscale scores were produced by calculating 
the mean score for all observations from each case on the 10 
BiCACS items then averaging together the items on each 
subscale. The Adherence and Competence subscales evi- 
denced moderate overlap (7 = .43, p < .001). These findings 
suggest that there is moderate overlap among the BiCACS 
Adherence and Competence subscales, indicating that the 
subscales measure distinct content. 
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Study 2 


The second study was conducted to evaluate whether 
coaches could reliably code each BiCACS item under typi- 
cal training conditions. Coaches involved in training the 
teachers produced scores on the BiCACS items following 
live observations of teachers delivering the BEST in CLASS 
program in early childhood classrooms. For treatment integ- 
rity measures to contribute to efforts to establish and main- 
tain treatment integrity, it is important to demonstrate 
reliability at the item level as well as demonstrate sensitivity 
to changes in treatment integrity over the course of program 
delivery. Study 2 was therefore designed to evaluate the pre- 
liminary reliability and construct validity of the BiCACS. 


Participants 


Child participants. In Year 3 of the development project, the 
BEST in CLASS program was implemented with 23 focal 
children in 11 state or federally funded classrooms serving 
high-risk children within one suburban and one urban dis- 
trict on the east coast (see above for procedure details). The 
23 children (15 males and 8 females) in Study 2 averaged 
3.95 years of age (SD = 0.38; range from 3 to 5). Child par- 
ticipants included 16 African American, 2 Caucasian, | 
Asian/Pacific Islander, and 1 Latino (data were not pro- 
vided for 3 children). All children were enrolled in state or 
federally funded early childhood programs, qualified for 
free and reduced lunch, scored as “at risk” for future devel- 
opment of EBD as indicated by the ESP, and scored within 
the normal range on the BDI II. 


Teacher participants. Of the 11 female teachers (3 had a 
bachelor’s degree and 8 had a master’s degree), 5 were A fri- 
can American, 5 were Caucasian, and 1 was Latina. They 
averaged 9.27 years of experience with preschool-aged 
children (SD = 10.31; range 0-34 years). 


Coders. The primary coding team consisted of six coaches 
(five females). Five coaches were Caucasian and one was 
Latina. All coaches had completed their bachelor’s degree, 
and four had completed a master’s degree. Secondary cod- 
ers used for reliability analyses consisted of four data col- 
lectors (three of whom were also coaches). Three secondary 
coders were Caucasian and one was Latina. All secondary 
coders had completed their bachelor’s degree, and two had 
completed a master’s degree. 


BiCACS Scoring and Session Sampling 
Procedures 
Live observations of the BEST in CLASS program were 


conducted weekly by coaches during their regular coaching 
sessions with teachers across three phases of the program 


(i.e., baseline, treatment implementation, maintenance). A 
total of 289 observations were conducted of which 54 
(18.68%) were independently coded by a second observer 
for reliability purposes. During these observations, the 
coaches observed 15 min of instructional time during 
teacher-led activities and then completed the BiCACS. 
Coders’ training on the BiCACS consisted of reading the 
scoring manual and a 2-hr didactic training session. 


Measure for Validity Analyses 


The Student Teacher Relationship Scale (STRS; Pianta & 
Hamre, 2001) assesses teacher perceptions of relationships 
with children and was used in Study 2 to assess the validity 
of the BiCACS. The STRS consists of 15 items measured 
on a 5-point Likert-type scale, where | represents definitely 
does not apply and 5 represents definitely applies. Two sub- 
scales, Closeness and Conflict, are derived from the STRS. 
For the current sample, the internal consistency for both 
factors was acceptable (Cronbach’s a = .78 and .89 for 
closeness and conflict, respectively). The STRS has demon- 
strated validity with regard to predicting academic and 
social functioning in prekindergarten through the elemen- 
tary grades (Hamre & Pianta, 2001; Pianta, La Paro, Payne, 
Cox, & Bradley, 2002) and has been used extensively in 
studies of preschool and elementary-age children (e.g., 
Birch & Ladd, 1997, 1998; Howes & Hamilton, 1992; 
Howes & Ritchie, 1999). The STRS has been validated with 
low-income and minority samples (Hamre & Pianta, 2001). 


Results 


To determine whether coaches could reliably code items on 
the BiCACS under typical training conditions, interrater 
reliability for data collected during live observations was 
calculated using ICCs (Shrout & Fleiss, 1979; see Tables 2 
and 3). A coach and a secondary observer scored 18.68% of 
the sessions (n = 54) for reliability. Because a single coach 
coded all observations and reliability with another coder 
was calculated on a subset of the observations, the appropri- 
ate ICC estimate was single rater (McLeod, Islam, & Wheat, 
2013). Interrater reliability for the Adherence items ranged 
from “fair” to “excellent” (ICCs ranged from .44 to .91 
[M = .72, SD = .15], see Table 2), with four of the items in 
the “excellent” range, four in the “good” range, and two in 
the “fair” range. The Adherence subscale was highly reli- 
able (ICC = .90). The interrater reliability for the 
Competence items ranged from “poor” to “excellent” (ICCs 
ranged from .39 to .85 [M = .64, SD = .16], see Table 3), 
with three of the items in the “excellent” range, four in the 
“good” range, two items in the “fair” range, and one item in 
the “poor” range (see Cicchetti, 1994). The Competence 
subscale was also highly reliable (ICC = .85). In sum, the 
interrater reliability for the items were slightly lower than 
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Table 4. Bivariate Correlations Among BiCACS Adherence and Competence Items. 

Measure | 2 3 4 5 6 7 8 9 10 
I. Rules 74% 52 AG 44% 36% .20** 17 53* AL 27 
2. Clear routines .68** .82** 78 44 AS .28** .22°** .35* .50** 33 
3. Brisk instructional pace .66** 827 .87** A5** 62% 27 24° 34% 56% 35% 
4. Precorrection .63** SH .66** .60** .60** SY ei 31 6 1 38" .30** 
5. Proximity control 50** 56% .60** 52 69%" 29° 43° 2% 39 33° 
6. Preacademic OTR AL 56 57 3H AL Ag .63** .30** .02 .02 
7. Social OTR A5** 5 7e* 65° AD 52% 8 38% 29" .08 05 
8. Behavior specific praise 627° 61 69° 61 52 52 SOE .55* 24 3 1 
9. Corrective feedback 32" 44% 40** 36% 23° 28% 32 .30** .64** 64" 

10. Instructive feedback 44% .50** AT* AT 38% 32 40** 46* .70** .60** 


Note. The top half of the triangle contains correlations among BiCACS Adherence items whereas the bottom half of the triangle contains correlations 
among BiCACS Competence items; the bolded diagonal contains correlations between BiCACS Adherence—Competence items. BiCACS = BEST in 
CLASS Adherence and Competence Scale; OTR = opportunities to respond. 


*p < .05. **p < 01. 


those reported in Study 1, but interrater reliability for most 
items remained in the acceptable range. The single rater 
ICC does produce a lower reliability estimate, compared 
with the average rater used in Study 1. Thus, these findings 
suggest that coaches can produce reliable ratings using the 
BiCACS under typical training conditions. 

Next, we examined the amount of interitem overlap 
among the adherence and competence items (see Table 4). 
The interitem correlations among the adherence items were 
moderate to strong in strength and in the expected direction. 
This suggests that the teachers tended to use the interven- 
tions together and none of the items were redundant (7 > 
.85). The interitem correlations among the competence 
items were also moderate to strong in strength and in the 
expected direction. In fact, the competence items evidenced 
stronger correlations than the adherence items, though none 
of the items were redundant. The correlations between the 
adherence and competence items were moderate (Social 
Opportunities to Respond, r = .38, p < .001) to strong (Clear 
Routines, r = .82, p < .001). The strength of these correla- 
tions are similar to amount of overlap observed between the 
Adherence and Competence subscales (r = .71, p < .001). 
The correlations between corresponding Adherence and 
Competence items were generally stronger than the correla- 
tions between noncorresponding items on the Adherence 
and Competence subscales, which suggests that the adher- 
ence and competence ratings covaried. This finding is con- 
sistent with the BEST in CLASS program, which aims to 
increase the dosage and quality of the specific interven- 
tions. In sum, the interitem correlations were in the expected 
direction and generally support the construct validity of the 
items and subscales. The findings do, however, suggest that 
there was moderate to strong overlap among the BiCACS 
Adherence and Competence items and subscales. 

The relation of the BiCACS Adherence and Competence 
subscales to standardized measures of the student-teacher 


relationship are also relevant to the discriminant validity of 
the subscales. We used the STRS Closeness (STRS-CL) and 
Conflict (STRS-CO) scales to represent the student-teacher 
relationship. The Adherence subscale evidenced a strong 
positive correlation to the STRS-CL (7 = .51, p = .026) and 
a small negative correlation to the STRS-CO (r = —.13, p= 
.578). The Competence subscale demonstrated a similar 
pattern; the Competence subscale had a moderate positive 
relationship with the STRS-CL (r = .43, p = .065), and a 
small negative relationship with the STRS-CO (7 = —.18, 
p = 474). These correlations are in the expected direction, 
consistent with past treatment integrity research, and sup- 
port the discriminant validity of the Adherence and 
Competence subscales (Carroll et al., 2000; Hogue et al., 
2008). 

Finally, we examined the construct validity of the 
BiCACS Adherence and Competence subscales. As con- 
struct validity cannot be assessed directly, an indirect 
approach was used to determine whether the measure could 
identify expected differences within groups (Lambert & 
Hill, 1994). We evaluated whether the Adherence and 
Competence subscale scores could distinguish between dif- 
ferent phases of the BEST in CLASS program. It was 
expected that Adherence and Competence would increase 
over time due to the sequential introduction of modules 
with weekly performance-based coaching. For this analysis, 
treatment was characterized as being comprised of four 
phases. The first phase represented all baseline data collec- 
tion (3 weeks), the second represented the first half of treat- 
ment (Weeks 1-7 of the program), the third represented the 
second half of treatment (Weeks 8-14 of the program), and 
the last time point represented maintenance data collected | 
month after the end of the program (3 weeks). For the 
Adherence subscale, our analyses indicated the scores var- 
ied across the four phases, F(3, 72) = 10.03, p < .001. 
Independent-sample ¢ tests showed the following: 
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Table 5. BiCACS Adherence and Competence Subscale Scores 
Across Treatment Phases. 


Phase | Phase 2 Phase 3 Phase 4 
Subscale M (SD) M (SD) M (SD) M (SD) 
Adherence 3.78 (1.01) 4.55 (0.75) 5.21 (0.71) 4.91 (0.90) 


Competence 5.19 (0.97) 5.08 (0.72) 5.42 (0.70) 5.37 (0.89) 


Note. BiCACS = BEST in CLASS Adherence and Competence Scale. 


Adherence ratings were higher at Phase 3 than Phase 2 
(Ms = 5.21 and 4.55, SDs = 0.71 and 0.71, (36) = 2.79, p= 
.008) and Phase 1(Ms = 5.21 and 3.78, SDs = 0.71 and 1.01, 
t(36) = 5.06, p < .001); Adherence ratings were higher at 
Phase 4 than Phase 1 (Ms = 4.91 and 3.78, SDs = 0.90 and 
1.01, (36) = 3.64, p < .001); Adherence ratings were higher 
at Phase 2 than Phase 1 (Ms = 4.55 and 3.78, SDs = 0.75 and 
1.01, (36) = 2.67, p = .007). For the Competence subscale, 
our test indicated the scores did not vary across the treat- 
ment phases, F(3, 72) = 0.71, p = .547 (see Table 5). These 
findings suggest that the scores on the Adherence subscale 
are sensitive across treatment phases in a direction that 
would be expected given the sequential introduction of the 
components of the BEST in CLASS program; the scores on 
the Competence subscale did not indicate the same level of 
sensitivity. 


Discussion 


The purpose of this study was to describe the development 
of the BiCACS and report preliminary psychometric data 
for the measure. The analyses provide evidence to support 
the initial reliability and validity of the BiCACS. The 
BiCACS items and subscales demonstrated fair to strong 
reliability. The findings also generally supported the valid- 
ity of the BiCACS: The pattern of correlations among the 
items and subscales was in the expected direction and the 
subscales were distinct from a teacher-report measure of the 
child—teacher relationship. The analyses also indicated that 
the Adherence subscale is sensitive to changes in adherence 
over the course of the BEST in CLASS program. Thus, the 
BiCACS appears to be a promising treatment integrity mea- 
sure that may contribute to efforts designed to evaluate and 
implement the BEST in CLASS program in early childhood 
classrooms. 

Results from the current study are an important first step 
in establishing the psychometric properties of the BiCACS. 
Our findings suggest that trained coders and coaches can 
achieve adequate reliability at the item level with the 
BiCACS across videotaped and live observations. While 
the reliability of integrity measures for teacher-delivered 
programs targeting young children has not been reported, 
the interrater reliability of the BiCACS was comparable 


with the reliability at the item (e.g., Barber, Mercer, 
Krakauer, & Calvo, 1996; Hogue et al., 2008) and subscale 
(e.g., Carroll et al., 2000) level for treatment integrity mea- 
sures in the psychotherapy field. Importantly, we evaluated 
the reliability of the BiCACS across different conditions 
(videotaped and live observations) and observer types 
(trained coders and coaches). This means that the BiCACS 
may be flexible enough to be reliably used by different 
types of observers, which bodes well for its use in efficacy 
and effectiveness research. 

Compared with the adherence items and subscale, the 
interrater reliability for the competence items and subscales 
was lower. This is consistent with previous findings (e.g., 
Hogue et al., 2008), suggesting that competence may be 
harder to code than adherence to a treatment protocol. 
Although few studies have attempted to measure compe- 
tence in the prevention field (see Hagermoser Sanetti et al., 
2012; Webb, DeRubeis, & Barber, 2010), researchers have 
begun to recognize that competence may have an important 
relation to treatment outcomes (Harn, Parisi, & Stoolmiller, 
2013). There is considerable debate within the field about 
what level of training is needed for coders to rate compe- 
tence (e.g., Southam-Gerow & McLeod, 2013; Waltz, 
Addis, Koerner, & Jacobson, 1993). Some assert that model 
experts should code competence whereas others argue that 
trained graduate students are capable of coding compe- 
tence. Even though undergraduate and graduate students 
and trained coaches were able to reliably code competence 
in the present study, it is possible that model experts may 
have obtained higher interrater reliability estimates. Thus, 
more empirical work is needed to determine what level of 
training is needed to generate reliable and valid competence 
ratings. 

However, it is important to note that these reliability esti- 
mates should be considered conservative. We used a mea- 
sure designed to code the integrity of the BEST in CLASS 
program to rate observations of teachers only delivering 
this program. This within-condition approach can limit 
variability in intervention delivery and thus represents a 
conservative approach to estimating reliability (Startup & 
Shapiro, 1993). Taken together, the results reported across 
both studies suggest that the BiCACS demonstrates ade- 
quate reliability. 

Our findings suggest that trained coders and coaches can 
reliably use the BiCACS to code videotaped and live obser- 
vations, which has important implications for future research 
applications of the measure. These data indicate that the 
BiCACS can be used by (a) trained coders as a manipulation 
check to aid interpretation of findings, and (b) coaches as a 
tool for informing teacher training efforts. Reliability at the 
item level makes it possible for researchers to investigate 
process—outcome relations at the intervention level, which 
could help identify the core ingredients of the BEST in 
CLASS program. Researchers have highlighted the need for 
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reliable measures of integrity to advance the science of pre- 
vention (Durlak, 2010; Hagermoser Sanetti & Kratochwill, 
2009; Wolery, 2011), and the reliability data at the subscale 
and item level for the BiCACS are a promising step in this 
direction. 

Our findings also provide preliminary support for the 
validity of the BiCACS items and subscales. At the item 
level, the associations among the Adherence and 
Competence items were all in the expected direction, sup- 
porting the construct validity of the items. At the subscale 
level, the Adherence and Competence subscales demon- 
strated overlap with a measure of the quality of the 
child—teacher relationship in the expected direction (Hogue 
et al., 2008). Moreover, the magnitude of the relationship 
between the subscales and the affective scale was consistent 
with past research investigating this association (Carroll 
et al., 2000). These findings therefore provide preliminary 
support for the discriminant validity of the BiCACS 
subscales. 

We also investigated the amount of overlap among the 
Adherence and Competence items and subscales. Across 
the two studies, the Adherence and Competence items and 
subscales evidenced moderate to strong overlap. These 
findings are consistent with past studies that have examined 
these associations at the item (Hogue et al., 2008) and sub- 
scale level (Carroll et al., 2000). Moreover, these findings 
suggest that the adherence and competence of delivery of 
the BEST in CLASS interventions were associated in the 
expected direction. 

However, it is important to note that theoretical and 
empirical work has not clarified the amount of overlap 
between the adherence and competence integrity compo- 
nents (see Barber et al., 2007). The degree of overlap 
between adherence and competence in the psychotherapy 
field has ranged from moderate (rs = .31 to .40; Carroll 
et al., 2000) to high (7s = .77 to .90; Barber et al., 1996). 
Moreover, the relationship between these components in 
school-based prevention work remains unclear as well 
(Hagermoser Sanetti & Kratochwill, 2009; Schulte et al., 
2009). The Adherence and Competence subscales evi- 
denced moderate to strong overlap, which is to be expected 
given the focus of BEST in CLASS on increasing the fre- 
quency of component delivery as well as the quality of 
implementation. Of course, we cannot rule out the potential 
impact response bias may have had on our findings; how- 
ever, our findings do suggest that the strength in the relation 
may vary across observer type. 

Our findings also indicate that the Adherence subscale 
may be capable of measuring variability in treatment 
implementation. Scores on the Adherence subscale 
increased across time, which is to be expected given the 
sequential introduction of coaching associated with learn- 
ing modules, providing some preliminary evidence that the 
subscale can measure variability in treatment adherence. 


However, the Competence subscale did not demonstrate 
significant increases across time. 

Our findings supporting the validity of the BiCACS sub- 
scales add to the literature in a few important ways. Few 
studies have examined the validity of treatment integrity 
measures, so the current findings represent an initial step to 
address this issue. Moreover, the ability to assess variability 
is particularly important when interpreting findings from a 
randomized controlled trial. Simply put, it is important to 
determine whether treatment adherence varies across time, 
teachers, or schools as this may account for differences in 
outcomes (Hagermoser Sanetti, Gritter, & Dobey, 2011; 
Wolery, 2011). Being able to identify variability in treat- 
ment integrity can also inform teacher training efforts, espe- 
cially when the integrity measure is used by coaches as part 
of a training system. Therefore, our findings help demon- 
strate that observational integrity measures used in school 
settings can be sensitive to changes in treatment adherence 
over time, which means that the measures may have some 
utility in implementation research. 

Although our study has a number of strengths, there are 
a few limitations that should be kept in mind as readers 
interpret the results. First, in addition to having a small sam- 
ple, our videotaped sample was not randomly selected as 
the collection of these videotapes was part of the program 
development process. While we have no reason to believe 
that our videotaped sample was not representative of imple- 
mentation of BEST in CLASS by all teachers in Years 2 and 
3 of the development project, readers should take this into 
consideration as they interpret our findings. Second, the 
data reported in this article are from an integrity measure 
developed to assess adherence and competence for BEST in 
CLASS and therefore are not generalizable to other pro- 
grams. That said, the process used to develop the BiCACS 
as well as the preliminary findings may be of use to 
researchers developing integrity measures for their own 
early intervention programs that target reduction in problem 
behavior of young, high-risk children. Third, Study 2 had 
secondary observations conducted on approximately 19% 
of the total observations, and this is lower than the mini- 
mally recommended 20% of total observation sessions 
(Kennedy, 2005). Finally, in the BEST in CLASS develop- 
ment project we did not have comparison classrooms. 
Therefore, we were not able to establish the criterion valid- 
ity of the BiCACS or provide meaning to the BiCACS 
scores by comparing them with levels in business-as-usual 
classrooms. 

In sum, as the field increasingly moves toward effective- 
ness research, integrity measures are needed to guide inter- 
pretation of findings, thereby helping to identify 
implementation and evaluation barriers to transporting 
EBPs to new settings, as well as identifying active ingredi- 
ents of EBPs. The development of the BiCACS is a promis- 
ing first step in this direction, and hopefully will serve as a 
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stepping stone to further development work in integrity 
research in early childhood classrooms as well as provide a 
blueprint for other researchers conducting prevention 
research in classroom settings. 
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